Nathaniel Dake Blog

3. Naive Bayes Classification

This is an article that I have been excited to write for quite some time! While the model and classification techniques are nothing new, Naive Bayes (and Bayes Nets generalizations) build upon all of the intuitions that I have laid out in past posts on probability theory, classic frequentist statistical inference, and bayesian inference. I highly recommend that you take some time to read them over if your background in statistics, probability, and bayesian inference is rusty (I will particularly be assuming the reader is comfortable with bayesian inference).

With that said let's get started!

1. Setting the Stage: A Classification Problem

I want to begin by introducing a simple toy classification problem that can help create a hollistic picture of what we are trying to accomplish. Imagine that you are working as a Data Scientist for an energy company, and you are trying to predict whether a given customer will be a "big consumer" of energy that month, based on there energy consumption after 10 days. Now "big consumer" clearly is a rather arbitrary concept, but let's say that you have spoken to management and they are classifying a big consumer as anyone using over 2000 kWh in a given month. To make this a bit more clear, we may have historical data that looks something like:

Month Energy Usage (After 10 Days) Energy Usage (End of Month) Big Spender?
1 1200 3000 Y
2 450 890 N
3 600 1745 N
2 800 2100 Y
1 100 290 N
3 724 2020 Y
1 1800 1950 N
. . . .
. . . .
. . . .

As a side note, there are many reasons that we may want to classify a consumer as a big spender early in the month, particularly if we want to offer them insights into why they are consuming more energy, tips as to how they can reduce energy consumption, or a paper report in the mail (proven to be more engaging, while also the most costly form of engagement) that may help them reduce energy and and hence their current bill. Reasons aside, the question that arises is: After 10 days how do you classify user's as big spenders?

Of course there are many options! We could train a logistic regression model to find the pattern between our inputs, $X$ (energy usage after 10 days), and our output, $Y$ (big spender?), or we could fit a decision tree to the data, or an neural network-our choices are endless. As the data scientist your job is to understand the context of the problem, edge cases, business requirements, etc, and select the most appropriate model. In this case, considering the subject of the post, we will chose to use a Naive Bayes Model.

Note, I want to again make clear that this is of course a toy problem. Training any model to make robust predictions on a single feature is a tall order, working well only if the feature happens to significantly explain the variance in the dependent variable to a high degree$^1$ (for more on this see my post on dimensionality reduction, particularly the example pertaining to height vs. urefu, where one almost perfectly explains the other).

With that said, let's take a look at some of this data (I want to note that as always you can choose to show or hide the code that makes up this post by scroll to the bottom and clicking the corresponding button). We can see our two groups (big spenders and not big spenders) shown below:

In [1]:
import numpy as np
from scipy.stats import bernoulli, binom, norm, multivariate_normal
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rc, animation
from IPython.core.display import display, HTML
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter

from _plotly_future_ import v4_subplots
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.figure_factory as ff

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")
In [31]:
big_spender_mean = 2000
big_spender_std = 800
num_big_spenders = 10000
big_spender_usage_ten_days = np.random.normal(big_spender_mean, big_spender_std, num_big_spenders)

low_spender_mean = 200
low_spender_std = 100
num_low_spenders = 20000

low_spender_usage_ten_days = np.random.normal(low_spender_mean, low_spender_std, num_low_spenders)
In [9]:
trace1 = go.Histogram(
    x=big_spender_usage_ten_days,
    nbinsx=100,
    name="Big Spender",
    marker_color='blue',
)

trace2 = go.Histogram(
    x=low_spender_usage_ten_days,
    nbinsx=100,
    name="Low Spender",
    marker_color='crimson',
)

data = [trace1, trace2]

layout = go.Layout(
    width=750,
    height=450,
    title="Big Spender vs. Non Big Spender Distribution",
    xaxis=dict(title="Usage after 10 days (kWh)"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can calculate a few descriptive statistics from the above data and see that our low spenders have a mean of ~$200$ kWH of energy used after 10 days, while our big spenders have a mean of $200$ kWh and a much higher standard deviation (spread), $800$ kWH. For the sake of this problem our data was generated via a gaussian process (via numpy.random.normal), but in reality we could expect to see more of a long tail distribution for our big spenders (in other words, a business insight we could gleam is that big spenders will consist mainly of homes that have used one to two thousand kWh of energy after 10 days, but a few very small cases of large companies and office facilities using twenty times more energy after 10 days. Data that is generated via a gaussian process does not allow for that type of spread, so our process surely is not gaussian in nature).

Now that we have seen the data set that we are provided to work with, let's restate the question we are trying to answer:

In the next month, how can we classify customers as either big or low spenders based on this past dataset?

In other words, next month if we are given a customer who has used $700$ kWh in the first 10 days, do we predict that they are going to be a big spender?

2. Placing our objective into a mathematical context

2.1 Bayes Rule in a univariate setting

Now, one of the main jobs of a data scientist, applied mathematician, or really any analytical problem solve, is to place a qualitive question within a mathematical framework. Here we can put this into a probabilistic context, by stating that our goal is to find:

$$P(\text{Big Spender} \mid \text{Usage after 10 days})$$

Where it is common to refer to our data, usage after 10 days, as $X$, and our classification that we are trying to predict, big spender, as $Y$:

$$P(Y \mid X)$$

In english (but in a mathematical context), we can define our goal as:

Find the probability that a user is a big spender, given their first 10 days of usage.

How can we go about finding that? Well, let's actually think for a minute about what that quantity is defined as based on probability theory. Via the product rule$^2$, or more commonly referred to as the rule of conditional probability, we can see that:

$$P(XY) = P(X \mid Y) \cdot P(Y) = P(Y \mid X) \cdot P(X)$$

And via some simple algebra we arrive at:

$$P(Y \mid X) = \frac{P(X \mid Y) \cdot P(Y)}{P(X)}$$

What do each of these quantities represent, in english? We have:

$$P(X \mid Y) \longrightarrow \text{Probability of observing our data, X, given the consumer is a big spender}$$$$P(Y) \longrightarrow \text{Prior probability that the user a big spender}$$$$P(X) \longrightarrow \text{Probability of observing the data we did (across all types of spenders}$$

In a very general sense, this is the framework of bayesian inference (and the equation above is often refered to as bayes rule). In what I find to be the most intuitive notation, we can write the equation as:

$$P(H \mid DX) = \frac{P(D \mid HX) \cdot P(H \mid X)}{P(D \mid X)}$$

Where $H$ represents a particular hypothesis that we are interested in (someone is a big spender), $D$ represents our observed data, and $X$ represents all background information that our probabilities are conditional on. Each term is generally referred to as:

$$\overbrace{P(H \mid DX)}^{\text{posterior probability}} = \frac{\overbrace{P(D \mid HX)}^\text{likelihood} \cdot \overbrace{P(H \mid X)}^\text{prior probability}}{\underbrace{P(D \mid X)}_\text{probability of data}}$$

Now for an incredibly enlightening take on probability theory as extended logic (where the above notation is taken from), I highly recommend checking out Probability Theory: The Logic of Science.

With that said, why is is this relationship useful? Well, as stated earlier it can actually allow us to conduct a general inference. To see what I mean and build a bit of intuition around this process, I want you to think about what $P(X \mid Y)$ means for a moment. Do we have enough information to calculate it?

Keep that thought in mind and take a look at the distribution plot below. We have essentially fit two normal distributions to our historical data, one distribution was fit to the big spenders, the other to the non big spenders:

In [11]:
hist_data = [big_spender_usage_ten_days, low_spender_usage_ten_days ]

group_labels = ['Big Spender', 'Non Big Spender']
group_colors = ['blue', 'crimson']

fig_traces = ff.create_distplot(
    hist_data, 
    group_labels,
    colors=group_colors,
    bin_size=5,
    curve_type='normal'
)

layout = go.Layout(
    autosize=False,
    width=750,
    height=450,
    title="Big Spender vs. Non Big Spender Distribution",
    xaxis=dict(title="Usage after 10 days (kWh)"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=fig_traces, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

What exactly do I mean by fit a distribution to our data? Well remember that the normal distribution is defined mathematically as:

$$f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} exp(-\frac{(x-\mu)^2}{2\sigma^2})$$

And it is entirely parameterized by the mean and variance (in the univariate case). So, we can calculate the mean and variance for both of our classes respectively (big spenders and non big spenders):

$$\mu = \frac{1}{N} \sum_{i=1}^N x_i$$$$\sigma^2 = \frac{1}{N}\sum_{i = 1}^N (x_i - \mu)^2$$

And we can then plot the corresponding normal distributions fit to those parameters. In other words, we are finding two normal distributions:

$$\text{Big Spender} \longrightarrow f(x \mid \mu_{BS}, \sigma_{BS}^2) = \frac{1}{\sqrt{2 \pi \sigma_{BS}^2}} exp(-\frac{(x-\mu_{BS})^2}{2\sigma_{BS}^2})$$$$\text{Low Spender} \longrightarrow f(x \mid \mu_{LS}, \sigma_{LS}^2) = \frac{1}{\sqrt{2 \pi \sigma_{LS}^2}} exp(-\frac{(x-\mu_{LS})^2}{2\sigma_{LS}^2})$$

Where our big spender distribution is parameterized via $\mu_{BS} = 2000$ and $\sigma_{LS} = 800$, and our non big spender distribution is parameterized via $\mu_{BS} = 200$ and $\sigma_{LS} = 100$.

In [32]:
big_spender_mean = big_spender_usage_ten_days.mean()
big_spender_std = big_spender_usage_ten_days.std()

low_spender_mean = low_spender_usage_ten_days.mean()
low_spender_std = low_spender_usage_ten_days.std()

usage = np.arange(-300, 5000, 1)

big_spender_normal_dist = norm.pdf(usage, big_spender_mean, big_spender_std)
low_spender_normal_dist = norm.pdf(usage, low_spender_mean, low_spender_std)
In [12]:
trace2 = go.Scatter(
    x=usage,
    y=big_spender_normal_dist,
    marker = dict(
        color = 'blue',
    ),
    fill='tozeroy',
    name="Normal Distribution fit to Big Spender Data"
)

trace1 = go.Scatter(
    x=usage,
    y=low_spender_normal_dist,
    marker = dict(
        color = 'crimson',
    ),
    fill='tozeroy',
    name="Normal Distribution fit to Low Spender Data"
)



data = [trace1, trace2]
layout = go.Layout(
    legend=dict(x=0.5, y=0.97),
    width=750,
    height=450,
    title="Normal Distributions fit to Big and Non Big Spenders Historical Data",
    xaxis=dict(title="Usage in first 10 days of month"),
    yaxis=dict(title=r"$\text{Probability Density: } f(x \mid \mu, \sigma^2)$")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

Those distributions above are $P(X \mid Y)$! In other words, we have one of the key pieces to finding $P(Y \mid X)$. Now, how can that be useful? Well, say we are given a consumer that we want to classify; they have used $250$ kWh of energy in the first 10 days of the month. What is the likelihood, $P(X \mid Y)$, of the data point for each respective distribution? Let's take a look at the plot below:

In [72]:
trace2 = go.Scatter(
    x=usage,
    y=big_spender_normal_dist,
    marker = dict(
        color = 'blue',
    ),
    fill='tozeroy',
    name="Normal Distribution fit to Big Spender Data"
)

trace1 = go.Scatter(
    x=usage,
    y=low_spender_normal_dist,
    marker = dict(
        color = 'crimson',
    ),
    fill='tozeroy',
    name="Normal Distribution fit to Low Spender Data"
)

offset = 300
trace3 = go.Scatter(
    x=[usage[250 + offset]],
    y=[low_spender_normal_dist[250 + offset]],
    marker = dict(
        color = 'green',
        size=10
    ),
    mode="markers",
    name=f"Likelihood of observed data point, given low spender: {round(low_spender_normal_dist[250 + offset], 5)}"
)

trace4 = go.Scatter(
    x=[usage[250 + offset]],
    y=[big_spender_normal_dist[250 + offset]],
    marker = dict(
        color = 'gold',
        size=10
    ),
    mode="markers",
    name=f"Likelihood of observed data point, given big spender: {round(big_spender_normal_dist[250 + offset], 5)}"
)

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    legend=dict(
        x=0.5, y=0.97,
        font_size=10,
    ),
    width=750,
    height=450,
    title="Normal Distributions fit to Big and Non Big Spenders Historical Data",
    xaxis=dict(title="Usage in first 10 days of month"),
    yaxis=dict(title=r"$\text{Probability Density: } f(x \mid \mu, \sigma^2)$")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can see that, given the big spender distribution the likelihood of a consumer using $250$ kWh of energy in the first 10 days is $0.00005$. On the other hand, the likelihood of a non big spender using $250$ kWh in the first 10 days is $0.0035$. In other words:

$$P(\text{250 kWh in first 10 days} \mid \text{Low spender}) > P(\text{250 kWh in first 10 days} \mid \text{High spender})$$$$P(X \mid Y_{LS}) > P(X \mid Y_{BS})$$

At this point the gears should be turning and you should start to see potential ways to create an algorithmic decision procedure based on these likelihoods. A simple procedure would be to just predict that the consumer belongs to the class corresponding to a higher likelihood.

However, let's slow down for a moment. Clearly, even in this simple example there is a good deal of overlap between the two distributions. In the real world, there will be even more overlap and hence more ambiguity that the model is most likely going to struggle to handle. One solution, as mentioned earlier, is to include more features. You can imagine that time of year will surely effect this prediction. For instance, if we are in the month of June, the first 10 days are likely to consume less energy than the last 10 days (because as the month goes on, the temperature rises and air conditioning is more likely to be utilized). Similarly, a weather forecast could be included and a forecasted heat wave or cold front would clearly impact our prediction). So clearly adding more features would be beneficial; the question is, how exactly would can we achieve this?

2.2 Extend to a multivariate domain

Fortunately, extending our idea of likelihood maximization is not difficult at all! We simply need to make use of the multivariate gaussian, defined as:

$$f(x \mid \mu, \Sigma) = \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} exp\Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\Big)$$

Where now $x$ is a vector containing each feature, $\mu$ is a vector holding the mean values of each feature, and $\Sigma$ is the covariance matrix (an $D x D$ matrix where $D$ is the number of features present-our dimensionality). If this all seems a little abstract, fear not, we are going to walk through an example to clear up any ambiguity. I am going to add an additional feature, the difference between forecasted temperature on day 15 of the month and actual temperature on day 5. This feature will serve as a proxy/heuristic for whether the remaing half of the month will be hotter or colder than the first. As a simple example, imagine the temperature on day 5 of the month was 70 degrees F, while the temperature on day 15 was 85 degrees F. This implies that the temperature in the 2nd half of the month is greater than the first, and hence energy usage may increase by more than a factor of 3 before the month has ended (a factor of 3 would imply that each 10 day interval-first 10 days of month, middle 10 days, and last 10 days-all use the same amount of energy). Our data set will now look like:

Month Energy Usage (After 10 Days) Temp Diff (F) Energy Usage (End of Month) Big Spender?
1 1200 9 3000 Y
2 450 2 890 N
1 1800 5 1950 N
. . . .
. . . .
. . . .

From this we can see that calculate our distribution parameters, where again each class (big and non big spender) has their own mean, and now their own covariance:

$$\mu = \frac{1}{N} \sum_{i=1}^N x_i$$$$\Sigma = \frac{1}{N - 1} \sum_{i=1}^N (x_i - \mu)(x_i - \mu)^T$$

I want to emphasize again that:

Each class, big spender and non big spenders, have their own mean and covariance parameters. It is the fact that these parameters are different that subsequently allows for each class to have their own distributions.

If we let big_temp_diff_mean = 7 and low_temp_diff_mean = 2, we can make use of scipy's multivariate_normal.pdf to see how our distributions looks in this multivariate case (note: usage was scaled down by a factor of 100 to allow the 3-d plot to render in a reasonable time):

In [104]:
fig = plt.figure(figsize=(15, 5.5), dpi=200)       
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax2 = fig.add_subplot(1, 2, 2)

def init():
    #Parameters to set
    usage = np.arange(-3, 50, 0.5)
    temp_diff = np.arange(-20, 20, 0.5)
    usage, temp_diff = np.meshgrid(usage, temp_diff)

    big_usage_mean = 20
    big_usage_std = 8
    big_temp_diff_mean = 7
    big_temp_diff_std = 5

    low_usage_mean = 2
    low_usage_std = 1
    low_temp_diff_mean = 2
    low_temp_diff_std = 5

    pos = np.empty(usage.shape + (2,))
    pos[:, :, 0] = usage
    pos[:, :, 1] = temp_diff

    likelihood_dist_non_big_spender = multivariate_normal(
        [low_usage_mean, low_temp_diff_mean], 
        [
            [low_usage_std, 0], 
            [0, low_temp_diff_std]
        ]
    )
    likelihood_dist_big_spender = multivariate_normal(
        [big_usage_mean, big_temp_diff_mean], 
        [
            [big_usage_std, 0], 
            [0, big_temp_diff_std]
        ]
    )

    likelihood_nbs = likelihood_dist_non_big_spender.pdf(pos)
    likelihood_bs = likelihood_dist_big_spender.pdf(pos)

    # ---- LEFT: 3D Plot ----
    # Plot the surface.
    surf_low = ax1.plot_surface(
        usage,
        temp_diff,
        likelihood_nbs,
        cmap="Reds",
        linewidth=0, 
        alpha=0.5,
        antialiased=True,
        rstride=1, # rstride and cstride are difference makers
        cstride=1
    ) 
    surf_big = ax1.plot_surface(
        usage,
        temp_diff,
        likelihood_bs,
        cmap="Blues",
        linewidth=0, 
        alpha=0.5,
        antialiased=True,
        rstride=1, # rstride and cstride are difference makers
        cstride=1
    ) 

    # ---- RIGHT: Mesh ---- 
    c = ax2.pcolormesh(usage, temp_diff, likelihood_nbs, cmap="Reds", alpha=0.6)
    fig.colorbar(c, ax=ax2)
    c2 = ax2.pcolormesh(usage, temp_diff, likelihood_bs, cmap="Blues", alpha=0.4)
    fig.colorbar(c2, ax=ax2)             

    ax2.set_xlabel("Usage", labelpad=7)
    ax2.set_ylabel("Temp Diff", labelpad=7)
    ax2.set_title('Heat Map of Joint Distribution')

    ax1.set_xlabel('Usage', labelpad=9)
    ax1.set_ylabel('Temp Diff', labelpad=9)
    ax1.set_zlabel('Likelihood', labelpad=9)
    ax1.set_yticks([-15,-5, 5, 15])
    ax1.set_xticks([10,30,50])
    ax1.set_zticks([0.01, 0.03, 0.05, 0.07])

def animate(i):
    ax1.view_init(elev=10, azim=i*4)
    return fig,
    
plt.tight_layout()
plt.subplots_adjust(wspace=0.3, bottom=0.15, top=0.9)

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         init_func=init,
#         frames=90,
#         interval=100,
#     ).to_html5_video()
# )
# display(html_video)
gif_video = animation.FuncAnimation(
        fig,
        animate,
        init_func=init,
        frames=90,
        interval=100,
    )

gif_video.save("bayes_joint_likelihood_distribution_heat_map_animation.gif", writer='imagemagick')

plt.close()

Again, we can envision a decision procedure by which we take out 2 dimensional data point that we are trying to classify, and determine which class distribution has a higher likelihood of observing that data point. This still begs the question, what about this process will make it a Bayesian classifier?

2.3 Creating a Bayesian Classifier by including prior knowledge

Up until now we have mainly been dealing with what is known as the likelihood distribution, $P(X \mid Y)$, of which we have had one per class. Specifically, it can be viewed as: A probability distribution over our input features, given that the class is some value. In other words, we can calculate quantities such as:

$$P(X \mid Y=\text{"Big Spender"}) = P([\text{usage=300, temp diff = 4}] \mid Y=\text{"Big Spender"})$$

Or:

$$P(X \mid Y=\text{"Non big Spender"}) = P([\text{usage=800, temp diff = 9}] \mid Y=\text{"Non big Spender"})$$

And, as of now, we simply have a decision procedure to select the $Y$ that yields the higher probability density of our input features. This makes sense, but how could we improve upon this prediction procedure? Well, as we said earlier, what we are truly interested in finding is $P(Y \mid X)$, known as the posterior probability; this is the probability that a new data point belongs to a certain class, given our historical data. As defined earlier, we know that this quantity is defined via Bayes rule:

$$P(Y \mid X) = \frac{P(X \mid Y) \cdot P(Y)}{P(X)}$$

We can see that we will still be making use of the likelihood distribution, $P(X \mid Y)$, as well as two new quantities: the prior probability, $P(Y)$, and the probability of our data across all classes, $P(X)$.

The prior probability can be defined as the probability of being a big spender (or non big spender), based on no observed data. In other words, we can treat this as a simple frequency. If historically 20% of customers are big spenders, then we can write:

$$P(Y = \text{Big spender}) = 0.2$$$$P(Y = \text{Non Big spender}) = 0.8$$

Our probability of data can be determined via marginalizing the joint probability of $X$ and $Y$:

$$P(XY) = P(X \mid Y) P(Y)$$$$P(X) = \sum_Y P(XY)$$

Where we are essentially determining the probability of observing a specific data point across all hypothesis. For example, if we had a data point where the usage was 200 kWh and the temperature difference 2 degrees F, we could calculate $P(X)$ by finding the probability of observing a usage of 200 and temperature difference of 2 given the class is big spender, and then add that to the probability of observing a usage of 200 and temperature difference of 2 given the class is non big spender. Of course this will be performed across all data points. In the continuous case we would be utilizing integration instead of summation:

$$P(X) = \int_Y P(X, Y)$$

In our case, we can find the $P(X)$ via breaking down the product rule:

$$P(X) = P(X \mid Y=\text{Big spender})P(Y = \text{Big spender}) + P(X \mid Y=\text{Non big spender})P(Y = \text{Non big spender})$$

And at this point we have everything we need in order to create a full Bayes Classifier! We can walk through our earlier example to make sure this is clear. We know the following:

$$X = [\text{usage = 200, temp diff = 2}]$$$$\text{Big Spender Likelihood Parameters} \longrightarrow \mu_{usage} =2000, \mu_{temp} = 7, \sigma_{usage} = 800, \sigma_{temp} = 5 $$$$\text{Non Big Spender Likelihood Parameters} \longrightarrow \mu_{usage} = 200, \mu_{temp} = 2, \sigma_{usage} = 100, \sigma_{temp} = 1 $$$$P(Y = \text{Big spender}) = 0.2$$$$P(Y = \text{Non Big spender}) = 0.8$$

And we can calculate our posterior distributions as follows; starting with the posterior for big spender:

$$P(Y = \text{Big Spender} \mid X) = \frac{P(Y = \text{Big Spender}) P(X \mid Y=\text{"Big Spender"})}{P(X)}$$$$P(Y = \text{Big Spender} \mid X) = \frac{0.2 \cdot N\Big(X, [2000, 7], [800, 5] \Big)} {N\Big(X, [2000, 7], [800, 5] \Big) \cdot 0.2) + N\Big(X, [200, 2], [100, 1] \Big) \cdot 0.8}$$

Where our likelihoods can be calculated via:

big_usage_mean = 20
big_usage_std = 8
big_temp_diff_mean = 7
big_temp_diff_std = 5

low_usage_mean = 2
low_usage_std = 1
low_temp_diff_mean = 2
low_temp_diff_std = 5

likelihood_non_big_spender = multivariate_normal(
    [low_usage_mean, low_temp_diff_mean], 
    [
        [low_usage_std, 0], 
        [0, low_temp_diff_std]
    ]
)
likelihood_big_spender = multivariate_normal(
    [big_usage_mean, big_temp_diff_mean], 
    [
        [big_usage_std, 0], 
        [0, big_temp_diff_std]
    ]
)

likelihood_non_big_spender.pdf(np.array([2., 2.])) 
=> 0.07117625434171772

likelihood_big_spender.pdf([2., 2.])
=> 3.3158179074871642e-12

Leaving us with a final value of:

$$P(Y = \text{Big Spender} \mid X) = \frac{0.2 \cdot 3.315 \cdot 10^{-12}} {3.315 \cdot 10^{-12} \cdot 0.2 + 0.07112 \cdot 0.8} = 1.165 \cdot 10^{-11}$$

And the same calculation can be performed for the non big spender class:

$$P(Y = \text{Non big spender} \mid X) = \frac{P(Y = \text{Non big spender}) P(X \mid Y=\text{Non big spender})}{P(X)}$$$$P(Y = \text{Non big spender} \mid X)= \frac{0.8 \cdot N\Big(X, [200, 2], [100, 1] \Big)} {N\Big(X, [2000, 7], [800, 5] \Big) \cdot 0.2) + N\Big(X, [200, 2], [100, 1] \Big) \cdot 0.8}$$$$P(Y = \text{Non big spender} \mid X) = \frac{0.8 \cdot 0.07112} {3.315 \cdot 10^{-12} \cdot 0.2 + 0.07112 \cdot 0.8} = 0.999$$

Now of course in this case we will predict that the data point was generated by (belongs to) the non big spender class. There will obviously be many scenarios where the prediction is not quite as one sided.

2.4 Simplifying Calculation

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Appendix: References

  1. Introduction to Statistical Learning, Chapter 3
  2. Probability Theory: The Logic of Science, Chapter 2
In [4]:
# import warnings
# warnings.filterwarnings('ignore')

© 2018 Nathaniel Dake